121 research outputs found

    Whole Genome Sequence of a Turkish Individual

    Get PDF
    Although whole human genome sequencing can be done with readily available technical and financial resources, the need for detailed analyses of genomes of certain populations still exists. Here we present, for the first time, sequencing and analysis of a Turkish human genome. We have performed 35x coverage using paired-end sequencing, where over 95% of sequencing reads are mapped to the reference genome covering more than 99% of the bases. The assembly of unmapped reads rendered 11,654 contigs, 2,168 of which did not reveal any homology to known sequences, resulting in,1 Mbp of unmapped sequence. Single nucleotide polymorphism (SNP) discovery resulted in 3,537,794 SNP calls with 29,184 SNPs identified in coding regions, where 106 were nonsense and 259 were categorized as having a high-impact effect. The homo/ hetero zygosity (1,415,123:2,122,671 or 1:1.5) and transition/transversion ratios (2,383,204:1,154,590 or 2.06:1) were within expected limits. Of the identified SNPs, 480,396 were potentially novel with 2,925 in coding regions, including 48 nonsense and 95 high-impact SNPs. Functional analysis of novel high-impact SNPs revealed various interaction networks, notably involving hereditary and neurological disorders or diseases. Assembly results indicated 713,640 indels (1:1.09 insertion/ deletion ratio), ranging from 252 bp to 34 bp in length and causing about 180 codon insertion/deletions and 246 frame shifts. Using paired-end- and read-depth-based methods, we discovered 9,109 structural variants and compared our variant findings with other populations. Our results suggest that whole genome sequencing is a valuable tool for understanding variations in the human genome across different populations. Detailed analyses of genomes of diverse origins greatly benefits research in genetics and medicine and should be conducted on a larger scale

    Prediction of peptides binding to MHC class I and II alleles by temporal motif mining

    Get PDF
    Background: MHC (Major Histocompatibility Complex) is a key player in the immune response of most vertebrates. The computational prediction of whether a given antigenic peptide will bind to a specific MHC allele is important in the development of vaccines for emerging pathogens, the creation of possibilities for controlling immune response, and for the applications of immunotherapy. One of the problems that make this computational prediction difficult is the detection of the binding core region in peptides, coupled with the presence of bulges and loops causing variations in the total sequence length. Most machine learning methods require the sequences to be of the same length to successfully discover the binding motifs, ignoring the length variance in both motif mining and prediction steps. In order to overcome this limitation, we propose the use of time-based motif mining methods that work position-independently. Results: The prediction method was tested on a benchmark set of 28 different alleles for MHC class I and 27 different alleles for MHC class II. The obtained results are comparable to the state of the art methods for both MHC classes, surpassing the published results for some alleles. The average prediction AUC values are 0.897 for class I, and 0.858 for class II. Conclusions: Temporal motif mining using partial periodic patterns can capture information about the sequences well enough to predict the binding of the peptides and is comparable to state of the art methods in the literature. Unlike neural networks or matrix based predictors, our proposed method does not depend on peptide length and can work with both short and long fragments. This advantage allows better use of the available training data and the prediction of peptides of uncommon lengths

    FQStat: a parallel architecture for very high-speed assessment of sequencing quality metrics

    Get PDF
    Background: High throughput DNA/RNA sequencing has revolutionized biological and clinical research. Sequencing is widely used, and generates very large amounts of data, mainly due to reduced cost and advanced technologies. Quickly assessing the quality of giga-to-tera base levels of sequencing data has become a routine but important task. Identification and elimination of low-quality sequence data is crucial for reliability of downstream analysis results. There is a need for a high-speed tool that uses optimized parallel programming for batch processing and simply gauges the quality of sequencing data from multiple datasets independent of any other processing steps. Results: FQStat is a stand-alone, platform-independent software tool that assesses the quality of FASTQ files using parallel programming. Based on the machine architecture and input data, FQStat automatically determines the number of cores and the amount of memory to be allocated per file for optimum performance. Our results indicate that in a core-limited case, core assignment overhead exceeds the benefit of additional cores. In a core-unlimited case, there is a saturation point reached in performance by increasingly assigning additional cores per file. We also show that memory allocation per file has a lower priority in performance when compared to the allocation of cores. FQStat’s output is summarized in HTML web page, tab-delimited text file, and high-resolution image formats. FQStat calculates and plots read count, read length, quality score, and high-quality base statistics. FQStat identifies and marks low-quality sequencing data to suggest removal from downstream analysis. We applied FQStat on real sequencing data to optimize performance and to demonstrate its capabilities. We also compared FQStat’s performance to similar quality control (QC) tools that utilize parallel programming and attained improvements in run time. Conclusions: FQStat is a user-friendly tool with a graphical interface that employs a parallel programming architecture and automatically optimizes its performance to generate quality control statistics for sequencing data. Unlike existing tools, these statistics are calculated for multiple datasets and separately at the “lane,” “sample,” and “experiment” level to identify subsets of the samples with low quality, thereby preventing the loss of complete samples when reliable data can still be obtained. Includes 6 supplemental file

    Prediction of peptides binding to MHC class I and II alleles by temporal motif mining

    Get PDF
    Background: MHC (Major Histocompatibility Complex) is a key player in the immune response of most vertebrates. The computational prediction of whether a given antigenic peptide will bind to a specific MHC allele is important in the development of vaccines for emerging pathogens, the creation of possibilities for controlling immune response, and for the applications of immunotherapy. One of the problems that make this computational prediction difficult is the detection of the binding core region in peptides, coupled with the presence of bulges and loops causing variations in the total sequence length. Most machine learning methods require the sequences to be of the same length to successfully discover the binding motifs, ignoring the length variance in both motif mining and prediction steps. In order to overcome this limitation, we propose the use of time-based motif mining methods that work position-independently. Results: The prediction method was tested on a benchmark set of 28 different alleles for MHC class I and 27 different alleles for MHC class II. The obtained results are comparable to the state of the art methods for both MHC classes, surpassing the published results for some alleles. The average prediction AUC values are 0.897 for class I, and 0.858 for class II. Conclusions: Temporal motif mining using partial periodic patterns can capture information about the sequences well enough to predict the binding of the peptides and is comparable to state of the art methods in the literature. Unlike neural networks or matrix based predictors, our proposed method does not depend on peptide length and can work with both short and long fragments. This advantage allows better use of the available training data and the prediction of peptides of uncommon lengths

    Bayesian network prior: network analysis of biological data using external knowledge

    Get PDF
    Motivation: Reverse engineering GI networks from experimental data is a challenging task due to the complex nature of the networks and the noise inherent in the data. One way to overcome these hurdles would be incorporating the vast amounts of external biological knowledge when building interaction networks. We propose a framework where GI networks are learned from experimental data using Bayesian networks (BNs) and the incorporation of external knowledge is also done via a BN that we call Bayesian Network Prior (BNP). BNP depicts the relation between various evidence types that contribute to the event ‘gene interaction’ and is used to calculate the probability of a candidate graph (G) in the structure learning process. Results: Our simulation results on synthetic, simulated and real biological data show that the proposed approach can identify the underlying interaction network with high accuracy even when the prior information is distorted and outperforms existing methods

    Human transcriptome corresponding to human oocytes and use of said genes or the corresponding polypeptides to trans-differentiate somatic cells

    Get PDF
    The identification of 101 genes upregulated or differentially expressed by mature human oocytes is provided herein. These genes and the corresponding gene products will facilitate a greater understanding of oogenesis, folliculogenesis, fertilization, and embryonic development. In addition these genes and the corresponding gene products can be used to effect dedifferentiation and/or transdifferentiation of desired somatic cells. The resultant dedifferentiated cells and somatic cells derived therefrom can be used in cell therapies such as in the treatment of cancer, autoimmunity, and other diseases wherein specific types of cells such as hematopoietic cells may be depleted because of the underlying disease or the treatment of the disease. Also, a core group of 66 transcripts was identified by intersecting significantly up-regulated genes of the human oocyte with those from the mouse oocyte and from human and mouse embryonic stem cells. Within the up-regulated probe sets, the top overrepresented categories were related to RNA and protein metabolism, followed by DNA metabolism and chromatin modification. This invention therefore provides a comprehensive expression baseline of genes expressed in in vivo matured human oocytes. Further understanding of the biological role of these genes will also expand knowledge on meiotic cell cycle, fertilization, chromatin remodeling, lineage commitment, pluripotency, tissue regeneration, and morphogenesis

    Gene expression analysis of embryonic stem cells expressing VE-cadherin (CD144) during endothelial differentiation

    Get PDF
    Background: Endothelial differentiation occurs during normal vascular development in the developing embryo. This process is recapitulated in the adult when endothelial progenitor cells are generated in the bone marrow and can contribute to vascular repair or angiogenesis at sites of vascular injury or ischemia. The molecular mechanisms of endothelial differentiation remain incompletely understood. Novel approaches are needed to identify the factors that regulate endothelial differentiation. Methods: Mouse embryonic stem (ES) cells were used to further define the molecular mechanisms of endothelial differentiation. By flow cytometry a population of VEGF-R2 positive cells was identified as early as 2.5 days after differentiation of ES cells, and a subset of VEGF-R2+ cells, that were CD41 positive at 3.5 days. A separate population of VEGF-R2+ stem cells expressing the endothelial-specific marker CD144 (VE-cadherin) was also identified at this same time point. Channels lined by VE-cadherin positive cells developed within the embryoid bodies (EBs) formed by differentiating ES cells. VE-cadherin and CD41 expressing cells differentiate in close proximity to each other within the EBs, supporting the concept of a common origin for cells of hematopoietic and endothelial lineages. Results: Microarray analysis of \u3e45,000 transcripts was performed on RNA obtained from cells expressing VEGF-R2+, CD41+, and CD144+ and VEGF-R2-, CD41-, and CD144-. All microarray experiments were performed in duplicate using RNA obtained from independent experiments, for each subset of cells. Expression profiling confirmed the role of several genes involved in hematopoiesis, and identified several putative genes involved in endothelial differentiation. Conclusion: The isolation of CD144+ cells during ES cell differentiation from embryoid bodies provides an excellent model system and method for identifying genes that are expressed during endothelial differentiation and that are distinct from hematopoiesis

    Comparative analysis of single-cell transcriptomics in human and Zebrafish oocytes

    Get PDF
    Background: Zebrafish is a popular model organism, which is widely used in developmental biology research. Despite its general use, the direct comparison of the zebrafish and human oocyte transcriptomes has not been well studied. It is significant to see if the similarity observed between the two organisms at the gene sequence level is also observed at the expression level in key cell types such as the oocyte. Results: We performed single-cell RNA-seq of the zebrafish oocyte and compared it with two studies that have performed single-cell RNA-seq of the human oocyte. We carried out a comparative analysis of genes expressed in the oocyte and genes highly expressed in the oocyte across the three studies. Overall, we found high consistency between the human studies and high concordance in expression for the orthologous genes in the two organisms. According to the Ensembl database, about 60% of the human protein coding genes are orthologous to the zebrafish genes. Our results showed that a higher percentage of the genes that are highly expressed in both organisms show orthology compared to the lower expressed genes. Systems biology analysis of the genes highly expressed in the three studies showed significant overlap of the enriched pathways and GO terms. Moreover, orthologous genes that are commonly overexpressed in both organisms were involved in biological mechanisms that are functionally essential to the oocyte. Conclusions: Orthologous genes are concurrently highly expressed in the oocytes of the two organisms and these genes belong to similar functional categories. Our results provide evidence that zebrafish could serve as a valid model organism to study the oocyte with direct implications in human

    Testing robustness of relative complexity measure method constructing robust phylogenetic trees for Galanthus L. Using the relative complexity measure

    Get PDF
    Background: Most phylogeny analysis methods based on molecular sequences use multiple alignment where the quality of the alignment, which is dependent on the alignment parameters, determines the accuracy of the resulting trees. Different parameter combinations chosen for the multiple alignment may result in different phylogenies. A new non-alignment based approach, Relative Complexity Measure (RCM), has been introduced to tackle this problem and proven to work in fungi and mitochondrial DNA. Result: In this work, we present an application of the RCM method to reconstruct robust phylogenetic trees using sequence data for genus Galanthus obtained from different regions in Turkey. Phylogenies have been analyzed using nuclear and chloroplast DNA sequences. Results showed that, the tree obtained from nuclear ribosomal RNA gene sequences was more robust, while the tree obtained from the chloroplast DNA showed a higher degree of variation. Conclusions: Phylogenies generated by Relative Complexity Measure were found to be robust and results of RCM were more reliable than the compared techniques. Particularly, to overcome MSA-based problems, RCM seems to be a reasonable way and a good alternative to MSA-based phylogenetic analysis. We believe our method will become a mainstream phylogeny construction method especially for the highly variable sequence families where the accuracy of the MSA heavily depends on the alignment parameters

    Bioinformatic identification and characterization of human endothelial cell-restricted genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In this study, we used a systematic bioinformatics analysis approach to elucidate genes that exhibit an endothelial cell (EC) restricted expression pattern, and began to define their regulation, tissue distribution, and potential biological role.</p> <p>Results</p> <p>Using a high throughput microarray platform, a primary set of 1,191 transcripts that are enriched in different primary ECs compared to non-ECs was identified (LCB >3, FDR <2%). Further refinement of this initial subset of transcripts, using published data, yielded 152 transcripts (representing 109 genes) with different degrees of EC-specificity. Several interesting patterns emerged among these genes: some were expressed in all ECs and several were restricted to microvascular ECs. Pathway analysis and gene ontology demonstrated that several of the identified genes are known to be involved in vasculature development, angiogenesis, and endothelial function (P < 0.01). These genes are enriched in cardiovascular diseases, hemorrhage and ischemia gene sets (P < 0.001). Most of the identified genes are ubiquitously expressed in many different tissues. Analysis of the proximal promoter revealed the enrichment of conserved binding sites for 26 different transcription factors and analysis of the untranslated regions suggests that a subset of the EC-restricted genes are targets of 15 microRNAs. While many of the identified genes are known for their regulatory role in ECs, we have also identified several novel EC-restricted genes, the function of which have yet to be fully defined.</p> <p>Conclusion</p> <p>The study provides an initial catalogue of EC-restricted genes most of which are ubiquitously expressed in different endothelial cells.</p
    corecore